R4DS 09 - Relational Data with dplyr
The codes below are from the practice exercises in https://r4ds.had.co.nz/, and are taken with reference from: https://jrnold.github.io/r4ds-exercise-solutions/
Loading tidyverse package.
Typically, there are many tables of data, and you must combine them to answer the questions you are interested in. Collectively, mutiple tables of data are called relational data, becase you are interested in the relations and not just the individual datasets.
airlines: look up the full carrier name from its abbreviated code
airlines
# A tibble: 16 x 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
airports: gives information about each airport, identified by the faa airport code
airports
# A tibble: 1,458 x 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne…
2 06A Moton Field Munic… 32.5 -85.7 264 -6 A America/Ch…
3 06C Schaumburg Region… 42.0 -88.1 801 -6 A America/Ch…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne…
5 09J Jekyll Island Air… 31.1 -81.4 11 -5 A America/Ne…
6 0A9 Elizabethton Muni… 36.4 -82.2 1593 -5 A America/Ne…
7 0G6 Williams County A… 41.5 -84.5 730 -5 A America/Ne…
8 0G7 Finger Lakes Regi… 42.9 -76.8 492 -5 A America/Ne…
9 0P2 Shoestring Aviati… 39.8 -76.6 1000 -5 U America/Ne…
10 0S9 Jefferson County … 48.1 -123. 108 -8 A America/Lo…
# … with 1,448 more rows
planes: gives information about each plane, identified by its tail number
planes
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed … EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed … EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed … AIRBUS INDU… A320… 2 182 NA Turbo…
# … with 3,312 more rows
weather: gives the weather at each NYC airport for each hour
weather
# A tibble: 26,115 x 15
origin year month day hour temp dewp humid wind_dir
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270
2 EWR 2013 1 1 2 39.0 27.0 61.6 250
3 EWR 2013 1 1 3 39.0 28.0 64.4 240
4 EWR 2013 1 1 4 39.9 28.0 62.2 250
5 EWR 2013 1 1 5 39.0 28.0 64.4 260
6 EWR 2013 1 1 6 37.9 28.0 67.2 240
7 EWR 2013 1 1 7 39.0 28.0 64.4 240
8 EWR 2013 1 1 8 39.9 28.0 62.2 250
9 EWR 2013 1 1 9 39.9 28.0 62.2 260
10 EWR 2013 1 1 10 41 28.0 59.6 260
# … with 26,105 more rows, and 6 more variables: wind_speed <dbl>,
# wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>,
# time_hour <dttm>
Imagine that you want to draw the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?
# require the latitude and longitude of the origin and destination airports of each flight.
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, …
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, …
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EW…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FL…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
glimpse(airports) # has lat, lon
Rows: 1,458
Columns: 8
$ faa <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7"…
$ name <chr> "Lansdowne Airport", "Moton Field Municipal Airport", …
$ lat <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.3…
$ lon <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778,…
$ alt <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 40…
$ tz <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5…
$ dst <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A",…
$ tzone <chr> "America/New_York", "America/Chicago", "America/Chicag…
flights_latlon <- flights %>%
inner_join(select(airports, origin = faa,
origin_lat = lat,
origin_lon = lon),
by = "origin") %>%
inner_join(select(airports, dest = faa,
dest_lat = lat,
dest_lon = lon),
by = "dest")
glimpse(flights_latlon)
Rows: 329,174
Columns: 23
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 554, 554, 555, 557, 557, 558, …
$ sched_dep_time <int> 515, 529, 540, 600, 558, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -6, -4, -5, -3, -3, -2, -2, -2, -2, …
$ arr_time <int> 830, 850, 923, 812, 740, 913, 709, 838, 753, …
$ sched_arr_time <int> 819, 830, 850, 837, 728, 854, 723, 846, 745, …
$ arr_delay <dbl> 11, 20, 33, -25, 12, 19, -14, -8, 8, -2, -3, …
$ carrier <chr> "UA", "UA", "AA", "DL", "UA", "B6", "EV", "B6…
$ flight <int> 1545, 1714, 1141, 461, 1696, 507, 5708, 79, 3…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "LGA", "EWR", "EWR", "LG…
$ dest <chr> "IAH", "IAH", "MIA", "ATL", "ORD", "FLL", "IA…
$ air_time <dbl> 227, 227, 160, 116, 150, 158, 53, 140, 138, 1…
$ distance <dbl> 1400, 1416, 1089, 762, 719, 1065, 229, 944, 7…
$ hour <dbl> 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, …
$ minute <dbl> 15, 29, 40, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
$ origin_lat <dbl> 40.69250, 40.77725, 40.63975, 40.77725, 40.69…
$ origin_lon <dbl> -74.16867, -73.87261, -73.77893, -73.87261, -…
$ dest_lat <dbl> 29.98443, 29.98443, 25.79325, 33.63672, 41.97…
$ dest_lon <dbl> -95.34144, -95.34144, -80.29056, -84.42807, -…
# first 100 flights
flights_latlon %>%
slice(1:100) %>%
ggplot(aes(
x = origin_lon, xend = dest_lon,
y = origin_lat, yend = dest_lat
)) +
borders("state") +
geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
coord_quickmap() +
labs( y = "Lat",
x = "Lon")
Add a surrogate key to flights
flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
flights %>%
mutate(flight_id = row_number()) %>%
glimpse()
Rows: 336,776
Columns: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, …
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, …
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EW…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FL…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
$ flight_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
Compute the average delay by destination, then join on the airports data frame so that you can show the spatial distribution of delays.
Add the location of the origin and destination to flights
airport_locations <- airports %>%
select(faa, lat, lon)
flights %>%
select(year:day, hour, origin, dest) %>%
left_join(airport_locations, by = c("origin" = "faa")) %>%
left_join(airport_locations, by = c("dest" = "faa"),
suffix = c("_origin", "_dest"))
# A tibble: 336,776 x 10
year month day hour origin dest lat_origin lon_origin lat_dest
<int> <int> <int> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH 40.7 -74.2 30.0
2 2013 1 1 5 LGA IAH 40.8 -73.9 30.0
3 2013 1 1 5 JFK MIA 40.6 -73.8 25.8
4 2013 1 1 5 JFK BQN 40.6 -73.8 NA
5 2013 1 1 6 LGA ATL 40.8 -73.9 33.6
6 2013 1 1 5 EWR ORD 40.7 -74.2 42.0
7 2013 1 1 6 EWR FLL 40.7 -74.2 26.1
8 2013 1 1 6 LGA IAD 40.8 -73.9 38.9
9 2013 1 1 6 JFK MCO 40.6 -73.8 28.4
10 2013 1 1 6 LGA ORD 40.8 -73.9 42.0
# … with 336,766 more rows, and 1 more variable: lon_dest <dbl>
Is there a relationship between the age of a plane and its delays?
# merge flights with planes (which contains plane years)
# calculate the average departure delay for each age of flight
plane_cohorts_dep <- inner_join(flights,
select(planes, tailnum, plane_year = year),
by = "tailnum") %>%
mutate(age = year - plane_year) %>%
filter(!is.na(age)) %>%
mutate(age = if_else(age>25, 25L, age)) %>%
group_by(age) %>%
summarise(dep_delay_mean = mean(dep_delay, na.rm = T),
dep_delay_sd = sd(dep_delay, na.rm = T),
n_dep_delay = sum(!is.na(dep_delay))) %>%
ggplot(aes(x = age, y = dep_delay_mean)) +
geom_point() +
scale_x_continuous("Age of plane(years)", breaks = seq(0,30, by = 10)) +
scale_y_continuous("Mean Dep Delays (min)") +
labs(subtitle = "Departure delay increases with age of plane until 10 years, then it declines and flattens out.",
title = "Relationship between Departure Delay and Age of Plane") +
theme_classic()
plane_cohorts_arr <- inner_join(flights,
select(planes, tailnum, plane_year = year),
by = "tailnum") %>%
mutate(age = year - plane_year) %>%
filter(!is.na(age)) %>%
mutate(age = if_else(age>25, 25L, age)) %>%
group_by(age) %>%
summarise(
arr_delay_mean = mean(arr_delay, na.rm = T),
arr_delay_sd = sd(arr_delay, na.rm = T),
n_arr_delay = sum(!is.na(arr_delay))) %>%
ggplot(aes(x = age, y = arr_delay_mean)) +
geom_point() +
scale_x_continuous("Age of plane(years)", breaks = seq(0,30, by = 10)) +
scale_y_continuous("Mean Arr Delays (min)") +
labs(subtitle = "Arr delay increases with age of plane until 10 years, then it declines and flattens out.",
title = "Relationship between Arr Delay and Age of Plane") +
theme_classic()
gridExtra::grid.arrange(plane_cohorts_dep, plane_cohorts_arr, nrow = 2)
What weather conditions make it more likely to see a delay?
flight_weather <- flights %>%
inner_join(weather, by = c(
"origin" = "origin",
"year" = "year",
"month" = "month",
"day" = "day",
"hour" = "hour"
))
flight_weather %>%
mutate(visib_cat = cut_interval(visib, n = 10)) %>%
group_by(visib_cat) %>%
summarise(dep_delay = mean(dep_delay, na.rm = T)) %>%
ggplot(aes(x = visib_cat, y = dep_delay)) +
geom_point() +
labs(title = "Relationship beween visibility and delay times",
subtitle = "A decrease in visibility increases delay timings") +
theme_classic()
Filter flights to only show flights with planes that have flown at least 100 flights.
flights_gte100 <- flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >=100)
flights %>%
semi_join(flights_gte100, by = "tailnum")
# A tibble: 228,390 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 544 545 -1 1004
4 2013 1 1 554 558 -4 740
5 2013 1 1 555 600 -5 913
6 2013 1 1 557 600 -3 709
7 2013 1 1 557 600 -3 838
8 2013 1 1 558 600 -2 849
9 2013 1 1 558 600 -2 853
10 2013 1 1 558 600 -2 923
# … with 228,380 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Combine fueleconomy::vehicles and fueleconomy::common to find only records for the most common models
# A tibble: 14,531 x 12
id make model year class trans drive cyl displ fuel hwy
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1833 Acura Integ… 1986 Subco… Auto… Fron… 4 1.6 Regu… 28
2 1834 Acura Integ… 1986 Subco… Manu… Fron… 4 1.6 Regu… 28
3 3037 Acura Integ… 1987 Subco… Auto… Fron… 4 1.6 Regu… 28
4 3038 Acura Integ… 1987 Subco… Manu… Fron… 4 1.6 Regu… 28
5 4183 Acura Integ… 1988 Subco… Auto… Fron… 4 1.6 Regu… 27
6 4184 Acura Integ… 1988 Subco… Manu… Fron… 4 1.6 Regu… 28
7 5303 Acura Integ… 1989 Subco… Auto… Fron… 4 1.6 Regu… 27
8 5304 Acura Integ… 1989 Subco… Manu… Fron… 4 1.6 Regu… 28
9 6442 Acura Integ… 1990 Subco… Auto… Fron… 4 1.8 Regu… 24
10 6443 Acura Integ… 1990 Subco… Manu… Fron… 4 1.8 Regu… 26
# … with 14,521 more rows, and 1 more variable: cty <dbl>
https://jrnold.github.io/r4ds-exercise-solutions/
For attribution, please cite this work as
lruolin (2021, May 17). pRactice corner: Relational Data. Retrieved from https://lruolin.github.io/myBlog/posts/20210517_Tidyverse Chap 10 - Relational Data/
BibTeX citation
@misc{lruolin2021relational, author = {lruolin, }, title = {pRactice corner: Relational Data}, url = {https://lruolin.github.io/myBlog/posts/20210517_Tidyverse Chap 10 - Relational Data/}, year = {2021} }